MC Control

So far, you have learned how the agent can take a policy \pi, use it to interact with the environment for many episodes, and then use the results to estimate the action-value function q_\pi with a Q-table.

Then, once the Q-table closely approximates the action-value function q_\pi, the agent can construct the policy \pi' that is \epsilon-greedy with respect to the Q-table, which will yield a policy that is better than the original policy \pi.

Furthermore, if the agent alternates between these two steps, with:

Step 1: using the policy \pi to construct the Q-table, and
Step 2: improving the policy by changing it to be \epsilon-greedy with respect to the Q-table (\pi' \leftarrow \epsilon\text{-greedy}(Q), \pi \leftarrow \pi'),

we will eventually obtain the optimal policy \pi_*.

Since this algorithm is a solution for the control problem (defined below), we call it a Monte Carlo control method.

Control Problem: Estimate the optimal policy.

It is common to refer to Step 1 as policy evaluation, since it is used to determine the action-value function of the policy. Likewise, since Step 2 is used to improve the policy, we also refer to it as a policy improvement step.

So, using this new terminology, we can summarize what we've learned to say that our Monte Carlo control method alternates between policy evaluation and policy improvement steps to recover the optimal policy \pi_*.

## The Road Ahead

You now have a working algorithm for Monte Carlo control! So, what's to come?

In the next concept (Exploration vs. Exploitation), you will learn more about how to set the value of \epsilon when constructing \epsilon-greedy policies in the policy improvement step.
Then, you will learn about two improvements that you can make to the policy evaluation step in your control algorithm.
- In the Incremental Mean concept, you will learn how to update the policy after every episode (instead of waiting to update the policy until after the values of the Q-table have fully converged from many episodes).
- In the Constant-alpha concept, you will learn how to train the agent to leverage its most recent experience more effectively.

Finally, to conclude the lesson, you will write your own algorithm for Monte Carlo control to solve OpenAI Gym's Blackjack environment, to put your new knowledge to practice!